1. Introduction

Heart disease remains the leading cause of death globally, accounting for millions of deaths annually. The ability to accurately predict heart disease can lead to early diagnosis and intervention, significantly reducing the mortality rate associated with this condition. This project aims to leverage data science and machine learning techniques to analyze and predict the likelihood of heart attacks based on a wide range of factors including age, sex, chest pain type, resting blood pressure, serum cholesterol, and more.

Utilizing a dataset from Kaggle, this analysis will explore various features that may influence heart disease, implement several machine learning models to predict heart disease occurrence, and evaluate the performance of these models. The dataset includes diverse variables, offering a comprehensive insight into the factors that might contribute to heart disease.

Through this project, we aim to uncover significant predictors of heart disease, assess the predictive power of machine learning models in a medical context, and ultimately, contribute to the efforts of preventing heart disease by enabling early detection. This endeavor is not only a technical challenge but also a crucial step towards saving lives and improving health outcomes.

As we delve into the data and models, our goal is to present a clear, concise, and informative analysis that can serve as a foundation for further research and practical applications in the field of medical science and public health.

The Heart Attack Analysis & Prediction Dataset on Kaggle includes data relevant for analyzing and predicting heart attacks. To access this dataset, you can visit the Kaggle dataset page.

1.1 Examining the Project Topic

What is Heart Attack?

  • The medical name of heart attack is “Myocardial infarction”.
  • Heart attack in short; It is the occlusion of the vessel by plaque-like lesions filled with cholesterol and fat.
  • The lesion is called abnormal conditions that occur in the organs where the disease is located.
  • As a result of the blockage, the blood flow is completely cut off and a heart attack that can lead to death occurs.

How does a heart attack occur?

  • The heart is like an efficient pump, tirelessly working to circulate blood throughout the body about 60 to 80 times a minute when we’re at rest. Just like any other part of the body, the heart itself needs a steady supply of nutrients and oxygen, which it gets through its very own set of blood vessels known as coronary arteries.

  • Sometimes, these crucial arteries can have trouble with their blood flow due to blockages or narrowing, a condition known as coronary insufficiency. This issue can vary widely – it all depends on where the blockage is, how severe it is, and the specific arteries affected.

  • For some, this might just mean chest pain that pops up during a workout or some physical task, but goes away once they take a break. However, in more serious cases, if a coronary artery gets suddenly completely blocked, it can trigger a heart attack. This often starts with intense chest pain and, in the worst-case scenario, can lead to sudden death.

1.2 Recognizing Variables In Dataset

Variable definitions in the Dataset

  • Age: Age of the patient
  • Sex: Sex of the patient
  • exang: exercise induced angina (1 = yes; 0 = no)
  • ca: number of major vessels (0-3)
  • cp: Chest Pain type chest pain type
    • Value 1: typical angina
    • Value 2: atypical angina
    • Value 3: non-anginal pain
    • Value 4: asymptomatic
  • trtbps: resting blood pressure (in mm Hg)
  • chol: cholestoral in mg/dl fetched via BMI sensor
  • fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
  • rest_ecg: resting electrocardiographic results
    • Value 0: normal
    • Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
    • Value 2: showing probable or definite left ventricular hypertrophy by Estes’ criteria
  • thalach: maximum heart rate achieved
  • target: 0= less chance of heart attack 1= more chance of heart attack

Additional variable descriptions to help us

  1. age - age in years

  2. sex - sex (1 = male; 0 = female)

  3. cp - chest pain type (1 = typical angina; 2 = atypical angina; 3 = non-anginal pain; 0 = asymptomatic)

  4. trestbps - resting blood pressure (in mm Hg on admission to the hospital)

  5. chol - serum cholestoral in mg/dl

  6. fbs - fasting blood sugar > 120 mg/dl (1 = true; 0 = false)

  7. restecg - resting electrocardiographic results (1 = normal; 2 = having ST-T wave abnormality; 0 = hypertrophy)

  8. thalach - maximum heart rate achieved

  9. exang - exercise induced angina (1 = yes; 0 = no)

  10. oldpeak - ST depression induced by exercise relative to rest

  11. slope - the slope of the peak exercise ST segment (2 = upsloping; 1 = flat; 0 = downsloping)

  12. ca - number of major vessels (0-3) colored by flourosopy

  13. thal - 2 = normal; 1 = fixed defect; 3 = reversable defect

  14. num - the predicted attribute - diagnosis of heart disease (angiographic disease status) (Value 0 = < diameter narrowing; Value 1 = > 50% diameter narrowing)

2. First Organization

2.1 Required Python Libraries

2.1.1 Basic Libraries

library(corrplot)  # for the correlation plot
library(discrim)  # for linear discriminant analysis
library(corrr)   # for calculating correlation
library(knitr)   # to help with the knitting process
library(MASS)    # to assist with the markdown processes
library(tidyverse)   # using tidyverse and tidymodels for this project mostly
library(tidymodels)
library(ggplot2)   # for most of our visualizations
library(ggrepel)
library(rpart.plot)  # for visualizing trees
library(vip)         # for variable importance 
library(janitor)     # for cleaning out our data
library(ranger)   # for building our randomForest
library(dplyr)     # for basic r functions
library(yardstick) # for measuring certain metrics
library(naniar)
library(kableExtra) #
library(kknn)
tidymodels_prefer()

2.2 Loading The Dataset

Heart_data <- read_csv("data/heart.csv", show_col_types = FALSE)
Heart_data
## # A tibble: 303 × 14
##      age   sex    cp trtbps  chol   fbs restecg thalachh  exng oldpeak   slp
##    <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl>   <dbl>    <dbl> <dbl>   <dbl> <dbl>
##  1    63     1     3    145   233     1       0      150     0     2.3     0
##  2    37     1     2    130   250     0       1      187     0     3.5     0
##  3    41     0     1    130   204     0       0      172     0     1.4     2
##  4    56     1     1    120   236     0       1      178     0     0.8     2
##  5    57     0     0    120   354     0       1      163     1     0.6     2
##  6    57     1     0    140   192     0       1      148     0     0.4     1
##  7    56     0     1    140   294     0       0      153     0     1.3     1
##  8    44     1     1    120   263     0       1      173     0     0       2
##  9    52     1     2    172   199     1       1      162     0     0.5     2
## 10    57     1     2    150   168     0       1      174     0     1.6     2
## # ℹ 293 more rows
## # ℹ 3 more variables: caa <dbl>, thall <dbl>, output <dbl>

3. Preparation for Exploratory Data Analysis(EDA)

3.1 Examining Missing Values

vis_miss(Heart_data)

This plot shows us at a glance that there is no missing data in the whole dataset.

3.2 Examining Unique Values

unique_number <- sapply(Heart_data, function(x) length(unique(x)))
unique_values_df <- data.frame("Total Unique Values" = unique_number)
rownames(unique_values_df) <- names(Heart_data)
print(unique_values_df)
##          Total.Unique.Values
## age                       41
## sex                        2
## cp                         4
## trtbps                    49
## chol                     152
## fbs                        2
## restecg                    3
## thalachh                  91
## exng                       2
## oldpeak                   40
## slp                        3
## caa                        5
## thall                      4
## output                     2

Analysis Outputs(1)

  • According to the result from the unique value dataframe;
  • We determined the variables with few unique values ​​as categorical variables, and the variables with high unique values ​​as numeric variables.
  • In this context, Numeric Variables: “age”, “trtbps”, “chol”, “thalachh” and “oldpeak”
  • Categorical Variables: “sex”, “cp”, “fbs”, “restecg”, “exng”, “slp”, “caa”, “thall”, “output”

3.3 Separating variables (Numeric or Categorical)

numeric_var <- c("age", "trtbps", "chol", "thalachh", "oldpeak")

categoric_var <- c("sex", "cp", "fbs", "restecg", "exng", "slp", "caa", "thall", "output")

Heart_data %>%
  select(numeric_var) %>%
  summary()
##       age            trtbps           chol          thalachh        oldpeak    
##  Min.   :29.00   Min.   : 94.0   Min.   :126.0   Min.   : 71.0   Min.   :0.00  
##  1st Qu.:47.50   1st Qu.:120.0   1st Qu.:211.0   1st Qu.:133.5   1st Qu.:0.00  
##  Median :55.00   Median :130.0   Median :240.0   Median :153.0   Median :0.80  
##  Mean   :54.37   Mean   :131.6   Mean   :246.3   Mean   :149.6   Mean   :1.04  
##  3rd Qu.:61.00   3rd Qu.:140.0   3rd Qu.:274.5   3rd Qu.:166.0   3rd Qu.:1.60  
##  Max.   :77.00   Max.   :200.0   Max.   :564.0   Max.   :202.0   Max.   :6.20

4. Exploratory Data Analysis(EDA)

4.1 variable Analysis

4.1.1 Numerical Variables(Analysis with Distplot)

ggplot(Heart_data, aes(x = age)) +
  geom_histogram(binwidth = 5, fill = "blue", color = "black") +
  ggtitle("Age Distribution") +
  xlab("Age") +
  ylab("Frequency")

ggplot(Heart_data, aes(x = trtbps)) +
  geom_histogram(binwidth = 5, fill = "red", color = "black") +
  ggtitle(" resting blood pressure (in mm Hg) Distribution") +
  xlab(" resting blood pressure (in mm Hg)") +
  ylab("Frequency")

ggplot(Heart_data, aes(x = chol)) +
  geom_histogram(binwidth = 5, fill = "purple", color = "black") +
  ggtitle(" cholestoral Distribution") +
  xlab(" cholestoral") +
  ylab("Frequency")

ggplot(Heart_data, aes(x = thalachh)) +
  geom_histogram(binwidth = 5, fill = "green", color = "black") +
  ggtitle(" maximum heart rate achieved Distribution") +
  xlab(" maximum heart rate achieved") +
  ylab("Frequency")

ggplot(Heart_data, aes(x = oldpeak)) +
  geom_histogram(binwidth = 0.5, fill = "blue", color = "black") +
  ggtitle(" ST depression induced by exercise relative to rest Distribution") +
  xlab(" ST depression induced by exercise relative to rest") +
  ylab("Frequency")

Analysis Outputs(1)

Age Variable
  • The vast majority of patients are between 50 and 60.
  • There is a remarkable place on the chart. There is a decrease in patients between the ages of 47-and 50.
  • It looks like there are no outliers in the variable.
Trtbps Variable
  • The resting blood pressure of most patients is generally between 110 and 140.
  • Values after 180 can be considered as outliers.
  • There is hefty patient traffic between 115-120, 125-130, and 155-160 values.
Cholesterol Variable
  • Cholesterol value in most patients is between 200-and 280.
  • Values after 380 can be considered as outliers.
Thalach Variable
  • The maximum heart rate achieved in most patients is between 145-and 170.
  • In particular, The values before 80 can be considered outliers.
Oldpeak Variable
  • Values of the vast majority of patients in the variable range from 0 to 1.5.
  • Especially values after 2.5 can be considered as outliers.

4.1.2 Categorical Variables(Analysis with Pie Chart)

categoric_var
## [1] "sex"     "cp"      "fbs"     "restecg" "exng"    "slp"     "caa"    
## [8] "thall"   "output"
heart_data_summary_sex <- Heart_data %>%
  count(sex) %>%
  mutate(perc = n / sum(n))

ggplot(heart_data_summary_sex, aes(x = "", y = n, fill = factor(sex))) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar(theta = "y") +
  geom_text(aes(label = scales::percent(perc)), position = position_stack(vjust = 0.5)) +
  scale_fill_manual(values = c("#FF5733", "#33B5FF")) +
  labs(title = "sex (Gender)", fill = "sex") +
  theme_void() +
  theme(legend.position = "bottom",
        plot.title = element_text(hjust = 0.5, color = "darkred", size = 15, face = "bold"),
        legend.text = element_text(color = "darkblue", size = 13, face = "bold"))   

heart_data_summary_cp <- Heart_data %>%
  count(cp) %>%
  mutate(perc = n / sum(n))

colors <- c("#FF5733", "#33B5FF", "#CDDC39", "#9C27B0")

ggplot(heart_data_summary_cp, aes(x = "", y = n, fill = factor(cp))) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar(theta = "y") +
  geom_text(aes(label = scales::percent(perc)), position = position_stack(vjust = 0.5)) +
  scale_fill_manual(values = colors) +
  labs(title = "cp (Chest Pain type )", fill = "cp") +
  theme_void() +
  theme(legend.position = "bottom",
        plot.title = element_text(hjust = 0.5, color = "darkred", size = 15, face = "bold"),
        legend.text = element_text(color = "darkblue", size = 13, face = "bold"))

heart_data_summary_fbs <- Heart_data %>%
  count(fbs) %>%
  mutate(perc = n / sum(n))

colors <- c("#FF5733", "#33B5FF")



ggplot(heart_data_summary_fbs, aes(x = "", y = n, fill = factor(fbs))) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar(theta = "y") +
  geom_text(aes(label = scales::percent(perc)), position = position_stack(vjust = 0.5)) +
  scale_fill_manual(values = colors) +
  labs(title = "fbs (fasting blood sugar )", fill = "fbs") +
  theme_void() +
  theme(legend.position = "bottom",
        plot.title = element_text(hjust = 0.5, color = "darkred", size = 15, face = "bold"),
        legend.text = element_text(color = "darkblue", size = 13, face = "bold"))

heart_data_summary_restecg <- Heart_data %>%
  count(restecg) %>%
  mutate(perc = n / sum(n))

colors <- c("#FF5733", "#33B5FF", "#CDDC39")



ggplot(heart_data_summary_restecg, aes(x = "", y = n, fill = factor(restecg))) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar(theta = "y") +
  geom_text(aes(label = scales::percent(perc)), position = position_stack(vjust = 0.5)) +
  scale_fill_manual(values = colors) +
  labs(title = "restecg (resting electrocardiographic results)", fill = "restecg") +
  theme_void() +
  theme(legend.position = "bottom",
        plot.title = element_text(hjust = 0.5, color = "darkred", size = 15, face = "bold"),
        legend.text = element_text(color = "darkblue", size = 13, face = "bold"))

heart_data_summary_exng <- Heart_data %>%
  count(exng) %>%
  mutate(perc = n / sum(n))

colors <- c("#FF5733", "#CDDC39")



ggplot(heart_data_summary_exng, aes(x = "", y = n, fill = factor(exng))) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar(theta = "y") +
  geom_text(aes(label = scales::percent(perc)), position = position_stack(vjust = 0.5)) +
  scale_fill_manual(values = colors) +
  labs(title = "exng (exercise induced angina)", fill = "exng") +
  theme_void() +
  theme(legend.position = "bottom",
        plot.title = element_text(hjust = 0.5, color = "darkred", size = 15, face = "bold"),
        legend.text = element_text(color = "darkblue", size = 13, face = "bold"))

heart_data_summary_slp <- Heart_data %>%
  count(slp) %>%
  mutate(perc = n / sum(n))

colors <- c("#FF5733", "#CDDC39", "#9C27B0")



ggplot(heart_data_summary_slp, aes(x = "", y = n, fill = factor(slp))) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar(theta = "y") +
  geom_text(aes(label = scales::percent(perc)), position = position_stack(vjust = 0.5)) +
  scale_fill_manual(values = colors) +
  labs(title = "slp (the slope of the peak exercise ST segment)", fill = "slp") +
  theme_void() +
  theme(legend.position = "bottom",
        plot.title = element_text(hjust = 0.5, color = "darkred", size = 15, face = "bold"),
        legend.text = element_text(color = "darkblue", size = 13, face = "bold"))

heart_data_summary_caa <- Heart_data %>%
  count(caa) %>%
  mutate(perc = n / sum(n))


ggplot(heart_data_summary_caa, aes(x = "", y = n, fill = factor(caa))) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar(theta = "y") +
  geom_text(aes(label = scales::percent(perc)), position = position_stack(vjust = 0.5)) +
  labs(title = "caa (number of major vessels)", fill = "caa") +
  theme_void() +
  theme(legend.position = "bottom",
        plot.title = element_text(hjust = 0.5, color = "darkred", size = 15, face = "bold"),
        legend.text = element_text(color = "darkblue", size = 13, face = "bold"))

heart_data_summary_thall <- Heart_data %>%
  count(thall) %>%
  mutate(perc = n / sum(n))


ggplot(heart_data_summary_thall, aes(x = "", y = n, fill = factor(thall))) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar(theta = "y") +
  geom_text(aes(label = scales::percent(perc)), position = position_stack(vjust = 0.5)) +
  labs(title = "thall (Thallium stress test)", fill = "thall") +
  theme_void() +
  theme(legend.position = "bottom",
        plot.title = element_text(hjust = 0.5, color = "darkred", size = 15, face = "bold"),
        legend.text = element_text(color = "darkblue", size = 13, face = "bold"))

heart_data_summary_output <- Heart_data %>%
  count(output) %>%
  mutate(perc = n / sum(n))


ggplot(heart_data_summary_output, aes(x = "", y = n, fill = factor(output))) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar(theta = "y") +
  geom_text(aes(label = scales::percent(perc)), position = position_stack(vjust = 0.5)) +
  labs(title = "output (diagnosis of heart disease)", fill = "output") +
  theme_void() +
  theme(legend.position = "bottom",
        plot.title = element_text(hjust = 0.5, color = "darkred", size = 15, face = "bold"),
        legend.text = element_text(color = "darkblue", size = 13, face = "bold"))

4.1.2.1 Analysis Outputs(2)

Sex Variable

  • 68% of the patients are male, 32% are female.
  • So, the number of male patients is more than twice that of female patients.

Cp Variable

  • Almost half of the patients have an observation value of 0. In other words, there is asymptomatic angina
  • Half of the patients are asymptomatic; they have pain without symptoms.
  • If we examine the other half of the pie chart, 1 out of 4 patients has an observation value of 2.
  • In other words, atypical angina is in 29% of the patients.
  • This observation value shows patients with shortness of breath or non-classical pain.
  • The other two observation values are less than the others.
  • 16.5% of patients have a value of 1. In other words, typical angina is seen. Typical angina is the classic exertion pain that comes during any physical activity.
  • The other 8% has the value of non-anginal pain, which is three types of angina.
  • Non-anginal pain is the term used to describe chest pain that is not caused by heart disease or a heart attack.

Fbs Variable

  • The vast majority of patients have an observation value of 1. In other words, 85%.
  • The fasting blood sugar of these patients is more than 120 mg/dl.
  • The remaining 15 percent have a less than 120 mg/dl fasting blood glucose level.

Restecg Variable

  • The thing that draws attention to the image of this variable is that the number of patients with two observation values is negligible.
  • It has a value of 1.3 percent. When we look at all of these patients, it is not a very important number.
  • This value represents the ST and T wavelengths of the patients.
  • Another point that draws attention to this graph is; The total numbers of other patients with observation values of 1 and 0 are almost equal.
  • The size of those with 1, that is, the orange part on the graph is 50.2%
  • This means that the resting electrocardiographic results of these patients are normal.
  • The percentage of patients with a value of 0 is 48.5%.
  • That is, the patients’ values of 48.5% are normal.

Exang Variable

  • We have said that this variable stands for exercise-induced angina.
  • Angina is the chest pain caused by the coronary artery’s involuntary contraction that feeds the heart.
  • According to the variable “exang,” the pain caused by this angina is represented by a value of 1 if it occurs with any exercise and 0 if it does not.
  • In this context, Values 0 are more than twice as values 1. More than half of the patients do not have exercise-induced angina.

Slp Variable

  • The minimum observation value is 0 with 7 percent.
  • This is patients with a downward slope of the ST wavelength.
  • The other two observation values are almost equal to each other.
  • The ST wavelength of half of the remaining patients is 1, that is straight, while the observation value of the other half is 2, that is, the ST wavelength is sloped upwards.

Caa variable

  • This variable is the number of great vessels colored by fluoroscopy.
  • In more than half of the patients, 57.8 percent, the number of large vessels is 0. That is, the number of large vessels colored by fluoroscopy is absent.
  • After 0 observation value, the other value with the most slices in the pie chart 1
  • The number of large vessels observed in 21.5% of the patients is 1
  • The majority of patients have an occlusion in their veins. Therefore, large vessels cannot be observed with the fluoroscopy technique.

Thall Variable

  • The “Thal” variable is short for the “Thallium stress test.”
  • The thallium stress test is simply an imaging method that evaluates the amount of blood reaching the heart muscle and determines whether a person has coronary artery disease.
  • There are three observation values in the description of this variable. However, the pie chart shows four values. Values 0, 1, 2 and 3.
  • According to our research, the observation value of 0 is null. Therefore, in the next step, 0 observation values will be returned to null and filled with logical data.
  • In this context, according to the thallium stress test results, 54.8 percent of the patients have two observation values, so the test result appears to be expected.
  • 36.8 percent has a value of 3, so we can say that this value is a reversible defect as an explanation.
  • 5.9 percent of patients have a value of 1, so the test result for these patients is a fixed defect.

Output Variable

  • More than half of the patients, 54.5 percent, have a heart attack risk. The remaining 45.5 percent have no heart attack risk.

4.1.2.2 Examining the Missing Data According to the Analysis Result

thal_zero_rows <- Heart_data %>% filter(thall == 0)
thal_zero_rows
## # A tibble: 2 × 14
##     age   sex    cp trtbps  chol   fbs restecg thalachh  exng oldpeak   slp
##   <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl>   <dbl>    <dbl> <dbl>   <dbl> <dbl>
## 1    53     0     2    128   216     0       0      115     0       0     2
## 2    52     1     0    128   204     1       1      156     1       1     1
## # ℹ 3 more variables: caa <dbl>, thall <dbl>, output <dbl>
# 0 will be filled with 2 that is most common value in thall
Heart_data$thall[Heart_data$thall == 0] <- 2

unique_thal_categories <- unique(Heart_data$thall)
unique_thal_categories
## [1] 1 2 3
vis_miss(Heart_data)

4.2.3 Numerical Variables - Categorical Variables (Analysis with Box Plot)

Heart_data_long <- Heart_data %>%
  gather(key = "variables", value = "value", trtbps, chol, thalachh)

ggplot(Heart_data_long, aes(x = variables, y = value, fill = factor(sex))) +
  geom_boxplot() +
  labs(title = "Numerical Variables - Categorical Variables (Box Plot)",
       x = "Variables",
       y = "Value",
       fill = "Sex") 

ggplot(Heart_data_long, aes(x = variables, y = value, fill = factor(cp))) +
  geom_boxplot() +
  labs(title = "Numerical Variables - Categorical Variables (Box Plot)",
       x = "Variables",
       y = "Value",
       fill = "Cp") 

ggplot(Heart_data_long, aes(x = variables, y = value, fill = factor(output))) +
  geom_boxplot() +
  labs(title = "Numerical Variables - Categorical Variables (Box Plot)",
       x = "Variables",
       y = "Value",
       fill = "output") 

Heart_data_long_2 <- Heart_data %>%
  gather(key = "variables", value = "value",age)

ggplot(Heart_data_long_2, aes(x = variables, y = value, fill = factor(output))) +
  geom_boxplot() +
  labs(title = "Numerical Variables - Categorical Variables (Box Plot)",
       x = "Variables",
       y = "Value",
       fill = "output") 

Heart_data_long_3 <- Heart_data %>%
  gather(key = "variables", value = "value",oldpeak)

ggplot(Heart_data_long_3, aes(x = variables, y = value, fill = factor(output))) +
  geom_boxplot() +
  labs(title = "Numerical Variables - Categorical Variables (Box Plot)",
       x = "Variables",
       y = "Value",
       fill = "output")

4.2.3.1 Analysis Outputs(3)

Sex - Numeric Variables
  • There is no very high correlation between “sex” and numerical variables. There is a weak relationship with all of them.
  • When the boxes of the observation values ​​of the gender variable are examined, it is seen that it is difficult to distinguish from each other.
Output - Numeric Variables
  • In the “old peak” variable, the median value of the blue box goes outside the red box. In other words, it shows that there is a more significant relationship between the “old peak” variable and the target compared to other numerical variables. We can say that there is a medium level of correlation.
  • There is also a correlation between the “thalach” and the output variables. Again, the median of the blue box goes outside the red box. This is an indication that there is a correlation compared to the others.
  • When we examine the other three variables, there is not much correlation.

4.3 Relationships between variables

cor_mat <- cor(Heart_data)
corrplot(cor_mat, method = "color", 
         addCoef.col = "black",   
         tl.cex = 0.8,            
         number.cex = 0.6,        
         cl.cex = 0.8,            
         tl.col = "black",        
         tl.srt = 45,             
         order = "hclust" )            

#### 4.2.3.1 Analysis Outputs(4)

Age Variable
  • The variable with the highest correlation with the “age” variable is the “thalachh” variable. There is a negative correlation between them, which we can call moderately.
  • The severity of the correlation is -0.40. In other words, there is an inverse relationship between the “age” and “thalachh” variables.
Trtbps Variable
  • The variable with the highest correlation with the “trtbps” variable is the “age” variable. The correlation between them is 0.28
  • There is a positive low-intensity correlation.
Chol Variable
  • The variable with the highest correlation with the “chol” variable is the “age” variable
  • There is a correlation with a magnitude of 0.21. This is a low positive correlation.
  • So, we can say that as age increases, cholesterol also increases.
Thalachh Variable
  • The variable with the highest correlation to the “Thalachh” variable is the “output” variable.
  • There is a 0.42 positive and moderate correlation between them. In other words, it is a variable that can directly trigger a heart attack.
  • There is a variable with which this variable has many correlations.
  • It means that the maximum heart rate reached maybe a situation triggered by other variables.
Oldpeak Variable
  • It has the most significant correlation ratio among this changing table. This correlation is -0.58 with the “slp” variable.
  • There is a negative correlation between them, which is slightly above medium intensity.
Sex Variable
  • There is no robust correlation between the variable “Sex” and other variables.
Fbs Variable
  • The “Fbs” variable generally does not correlate with other variables.
  • The highest correlation with 0.18 belongs to the “trtbps” variable. There is a low positive correlation.
Restecg Variable
  • There is no strong correlation between the “Rest_ecg” variable and other variables.
  • The highest correlation was 0.14 with the “output” variable. There is a positive low-intensity correlation.
output Variable
  • The “output” variable correlates with more than one variable.

5. Preparation for Modeling

5.1 Dropping Columns with Low Correlation

Heart_data <- Heart_data %>% select(-c(chol, fbs, restecg))
Heart_data
## # A tibble: 303 × 11
##      age   sex    cp trtbps thalachh  exng oldpeak   slp   caa thall output
##    <dbl> <dbl> <dbl>  <dbl>    <dbl> <dbl>   <dbl> <dbl> <dbl> <dbl>  <dbl>
##  1    63     1     3    145      150     0     2.3     0     0     1      1
##  2    37     1     2    130      187     0     3.5     0     0     2      1
##  3    41     0     1    130      172     0     1.4     2     0     2      1
##  4    56     1     1    120      178     0     0.8     2     0     2      1
##  5    57     0     0    120      163     1     0.6     2     0     2      1
##  6    57     1     0    140      148     0     0.4     1     0     1      1
##  7    56     0     1    140      153     0     1.3     1     0     2      1
##  8    44     1     1    120      173     0     0       2     0     3      1
##  9    52     1     2    172      162     0     0.5     2     0     3      1
## 10    57     1     2    150      174     0     1.6     2     0     2      1
## # ℹ 293 more rows

5.2 Data Splitting

Now we’re going to walk through the process of splitting the Heart_data data set into two, a training set and a test set.

set.seed(1234)
heart_split <- initial_split(Heart_data, prop = 0.80,
                                strata = output)
heart_train <- training(heart_split)
heart_test <- testing(heart_split)

5.3 Creating a Recipe

Heart_data$output <- as.factor(Heart_data$output)
heart_test$output <- as.factor(heart_test$output)
#make recipe
heart_train$output <- as.factor(heart_train$output)
heart_recipe <- recipe(output ~., data=heart_train) %>%
  step_dummy(all_nominal(), -all_outcomes()) %>%
  step_center(all_predictors(), -all_outcomes()) %>%
  step_scale(all_predictors(), -all_outcomes())
prep(heart_recipe) %>% 
  bake(new_data = heart_train) %>% 
  head() %>% 
  kable() %>% 
  kable_styling(full_width = F) 
age sex cp trtbps thalachh exng oldpeak slp caa thall output
1.3490876 0.6494454 -0.9444162 1.7141174 -1.7727510 1.4069260 0.3741999 -0.7014764 2.1413378 -0.520390 0
1.3490876 0.6494454 -0.9444162 -0.6572784 -0.8778349 1.4069260 1.3051534 -0.7014764 1.1870031 1.204738 0
0.8209750 -1.5334127 -0.9444162 0.5284195 0.4432318 -0.7078324 2.1514748 -2.3657636 1.1870031 -0.520390 0
0.9265975 0.6494454 -0.9444162 -0.0644295 -0.1107639 -0.7078324 0.2895678 -0.7014764 0.2326684 1.204738 0
-0.1296276 0.6494454 -0.9444162 0.5284195 0.2301565 1.4069260 1.7283141 -2.3657636 -0.7216663 1.204738 0
0.1872399 0.6494454 1.0006793 -0.0644295 -0.3238392 1.4069260 -0.3874893 -0.7014764 0.2326684 -2.245518 0

5.4 5-fold Cross-Validation

heart_folds <- vfold_cv(heart_train, v = 5)

6. Modelling

6.1 k-Nearest Neighbors

knn_spec <- nearest_neighbor(neighbors = tune()) %>% 
  set_engine("kknn") %>% 
  set_mode("classification")
# tune neighbors from 1 to 10
neighbors_grid <- grid_regular(neighbors(range = c(1, 10)), levels = 10)

knn_workflow <- workflow() %>%
  add_model(knn_spec) %>%
  add_recipe(heart_recipe)
knn_tune_res_class <- tune_grid(
  object = knn_workflow, 
  resamples = heart_folds, 
  grid = neighbors_grid
)
# For k-NN model
knn_metrics_class <- knn_tune_res_class %>% 
  collect_metrics()
knn_auc <- knn_metrics_class %>% filter(.metric == "roc_auc") # Filter for AUC metric

# Print AUC results
print(knn_auc)
## # A tibble: 10 × 7
##    neighbors .metric .estimator  mean     n std_err .config              
##        <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
##  1         1 roc_auc binary     0.801     5  0.0219 Preprocessor1_Model01
##  2         2 roc_auc binary     0.857     5  0.0151 Preprocessor1_Model02
##  3         3 roc_auc binary     0.876     5  0.0148 Preprocessor1_Model03
##  4         4 roc_auc binary     0.882     5  0.0145 Preprocessor1_Model04
##  5         5 roc_auc binary     0.884     5  0.0157 Preprocessor1_Model05
##  6         6 roc_auc binary     0.887     5  0.0148 Preprocessor1_Model06
##  7         7 roc_auc binary     0.888     5  0.0138 Preprocessor1_Model07
##  8         8 roc_auc binary     0.891     5  0.0120 Preprocessor1_Model08
##  9         9 roc_auc binary     0.893     5  0.0124 Preprocessor1_Model09
## 10        10 roc_auc binary     0.893     5  0.0130 Preprocessor1_Model10

From the data provided, the model with 10 neighbors (neighbors = 10) has the highest mean AUC of 0.8939479 . The standard error (std_err) for this model is also relatively low at 0.02220712, which suggests that the performance of the model is relatively stable across different folds of the data. Therefore, based on the data, Model 10 (with 10 neighbors) appears to have the best performance in terms of the area under the ROC curve across the different folds. It has the highest mean AUC and a low standard error, indicating strong and stable predictive performance.

best_neighbors_class <- select_by_one_std_err(knn_tune_res_class, desc(neighbors), metric = "roc_auc")
final_wf_knn_class <- finalize_workflow(knn_workflow, best_neighbors_class)

final_fit_knn_class <- fit(final_wf_knn_class, heart_train)

augment(final_fit_knn_class, new_data = heart_test) %>%
  roc_auc(output, .pred_0)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.934

The output tibble shows a ROC AUC of 0.9339 on the test set, indicating that the model also has strong classification ability on unseen data. It is noteworthy that the ROC AUC value on the test set is even higher than the average value in cross-validation, which is a very positive result. However, caution should be taken to consider the possibility of overly optimistic evaluations or data leakage.

Overall, this result should provide a high level of confidence, indicating that the model is likely to perform well in practical applications.

We can calculate accuracies on the training data:

knn_acc_train <- augment(final_fit_knn_class, new_data = heart_train) %>%
  accuracy(truth = output, estimate = .pred_class)
knn_acc_train
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.913

We can calculate accuracies on the testing data:

knn_acc_test <- augment(final_fit_knn_class, new_data = heart_test) %>%
  accuracy(truth = output, estimate = .pred_class)
knn_acc_test
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.836

The model’s accuracy on the training data is approximately 91.32%, while on the testing data, it is around 83.61%. These figures indicate that the KNN model is performing quite well, although it exhibits some degree of overfitting, as indicated by the higher accuracy on the training data compared to the testing data.

6.2 Logistic Regression

log_reg <- logistic_reg() %>% 
  set_engine("glm") %>% 
  set_mode("classification")
log_wkflow <- workflow() %>% 
  add_model(log_reg) %>% 
  add_recipe(heart_recipe)

log_fit_results <- fit(log_wkflow,heart_train)

log_fit_results %>% tidy()
## # A tibble: 11 × 5
##    term        estimate std.error statistic   p.value
##    <chr>          <dbl>     <dbl>     <dbl>     <dbl>
##  1 (Intercept)   0.143      0.198     0.723 0.470    
##  2 age          -0.0771     0.225    -0.342 0.732    
##  3 sex          -0.666      0.226    -2.95  0.00319  
##  4 cp            0.840      0.203     4.13  0.0000357
##  5 trtbps       -0.403      0.206    -1.95  0.0508   
##  6 thalachh      0.503      0.252     2.00  0.0456   
##  7 exng         -0.436      0.209    -2.09  0.0370   
##  8 oldpeak      -0.843      0.278    -3.03  0.00245  
##  9 slp           0.195      0.247     0.790 0.429    
## 10 caa          -0.773      0.214    -3.62  0.000300 
## 11 thall        -0.456      0.189    -2.41  0.0159

We can use each of these models to generate probability predictions for the training data:

predict(log_fit_results, new_data = heart_train, type = "prob")
## # A tibble: 242 × 2
##    .pred_0 .pred_1
##      <dbl>   <dbl>
##  1   0.995 0.00520
##  2   0.991 0.00919
##  3   0.911 0.0888 
##  4   0.879 0.121  
##  5   0.976 0.0243 
##  6   0.306 0.694  
##  7   0.348 0.652  
##  8   0.817 0.183  
##  9   0.991 0.00942
## 10   0.933 0.0666 
## # ℹ 232 more rows

However, it’s more useful to summarize the predicted values. We can use augment() to attach the predicted values to the data, then generate a confusion matrix for Heart:

augment(log_fit_results, new_data = heart_train) %>%
  conf_mat(truth = output, estimate = .pred_class) %>%
   autoplot(type = "heatmap")

We can calculate accuracies on the training data:

log_acc_train <- augment(log_fit_results, new_data = heart_train) %>%
  accuracy(truth = output, estimate = .pred_class)
log_acc_train
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.851

Let’s calculate the accuracy of the logistic regression model, or the average number of correct predictions it made on the testing data.

augment(log_fit_results, new_data = heart_test) %>%
  conf_mat(truth = output, estimate = .pred_class) %>%
   autoplot(type = "heatmap")

We can calculate accuracies on the testing data:

log_acc_test <- augment(log_fit_results, new_data = heart_test) %>%
  accuracy(truth = output, estimate = .pred_class)
log_acc_test
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.852
log_roc_train <- augment(log_fit_results, new_data = heart_train) %>%
  roc_auc(output, .pred_0)
log_roc_train
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.919
log_roc_test <- augment(log_fit_results, new_data = heart_test) %>%
  roc_auc(output, .pred_0)
log_roc_test
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.930

let’s look at an ROC curve on the testing data:

augment(log_fit_results, new_data = heart_test) %>%
  roc_curve(output, .pred_0) %>%
  autoplot()

6.3 LDA (linear discriminant analysis)

# Specify the linear discriminant analysis model
lda_model <- discrim_linear() %>%
  set_engine("MASS") %>%
  set_mode("classification")


# Create the workflow with the LDA model
lda_workflow <- workflow() %>%
  add_model(lda_model) %>%
  add_recipe(heart_recipe)

# Fit the LDA model
lda_fit_results <- fit(lda_workflow, data = heart_train)
augment(lda_fit_results, new_data = heart_train) %>%
  conf_mat(truth = output, estimate = .pred_class) %>%
   autoplot(type = "heatmap")

We can calculate accuracies on the training data:

lda_acc_train <- augment(lda_fit_results, new_data = heart_train) %>%
  accuracy(truth = output, estimate = .pred_class)
lda_acc_train
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.843

We can calculate accuracies on the testing data:

lda_acc_test <- augment(lda_fit_results, new_data = heart_test) %>%
  accuracy(truth = output, estimate = .pred_class)
lda_acc_test
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.852

accuracy on the testing data is greater than that of training data.

lda_roc_test <- augment(lda_fit_results, new_data = heart_test) %>%
  roc_auc(output, .pred_0)
lda_roc_test
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.930

6.4 QDA (quadratic discriminant analysis)

# Specify the quadratic discriminant analysis model
qda_model <- discrim_quad() %>%
  set_engine("MASS") %>%
  set_mode("classification")

# Create the workflow with the QDA model
qda_workflow <- workflow() %>%
  add_model(qda_model) %>%
  add_recipe(heart_recipe)

qda_fit_results <- fit(qda_workflow, data = heart_train)
augment(qda_fit_results, new_data = heart_train) %>%
  conf_mat(truth = output, estimate = .pred_class) %>%
   autoplot(type = "heatmap")

We can calculate accuracies on the training data:

qda_acc_train <- augment(qda_fit_results, new_data = heart_train) %>%
  accuracy(truth = output, estimate = .pred_class)
qda_acc_train
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.864

We can calculate accuracies on the testing data:

qda_acc_test <- augment(qda_fit_results, new_data = heart_test) %>%
  accuracy(truth = output, estimate = .pred_class)
qda_acc_test
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.820

These results suggest that the model performs well on both the training and unseen testing data, although there is a slight decrease in performance on the testing data, which is common as models generally perform slightly better on the data they were trained on.

qda_roc_test <- augment(qda_fit_results, new_data = heart_test) %>%
  roc_auc(output, .pred_0)
qda_roc_test
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.887

6.5 Regularized Regression

en_spec_lm_class <- logistic_reg(mixture = tune(),
                         penalty = tune()) %>%
  set_mode("classification") %>%
  set_engine("glmnet")

en_grid <- grid_regular(penalty(range = c(0, 1)),
                        mixture(range = c(0, 1)),
                             levels = 10)

en_workflow_lm_class <- workflow()%>%
  add_model(en_spec_lm_class) %>%
  add_recipe(heart_recipe)
en_tune_res_class <- tune_grid(
  object = en_workflow_lm_class, 
  resamples = heart_folds, 
  grid = en_grid
)
# For elastic net logistic regression model
en_metrics_class <- en_tune_res_class %>% 
  collect_metrics()

en_auc <- en_metrics_class %>% filter(.metric == "roc_auc")

# Print AUC results
print(en_auc)
## # A tibble: 100 × 8
##    penalty mixture .metric .estimator  mean     n std_err .config               
##      <dbl>   <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                 
##  1    1          0 roc_auc binary     0.887     5  0.0169 Preprocessor1_Model001
##  2    1.29       0 roc_auc binary     0.888     5  0.0177 Preprocessor1_Model002
##  3    1.67       0 roc_auc binary     0.888     5  0.0177 Preprocessor1_Model003
##  4    2.15       0 roc_auc binary     0.888     5  0.0177 Preprocessor1_Model004
##  5    2.78       0 roc_auc binary     0.890     5  0.0179 Preprocessor1_Model005
##  6    3.59       0 roc_auc binary     0.889     5  0.0178 Preprocessor1_Model006
##  7    4.64       0 roc_auc binary     0.889     5  0.0182 Preprocessor1_Model007
##  8    5.99       0 roc_auc binary     0.889     5  0.0183 Preprocessor1_Model008
##  9    7.74       0 roc_auc binary     0.888     5  0.0182 Preprocessor1_Model009
## 10   10          0 roc_auc binary     0.888     5  0.0182 Preprocessor1_Model010
## # ℹ 90 more rows
best_en_class <- select_best(en_tune_res_class, metric = "roc_auc")

en_wf_final_class <- finalize_workflow(en_workflow_lm_class, best_en_class)

# Fit the finalized workflow to the training set
final_fit_en_class <- fit(en_wf_final_class, data = heart_train)

We can calculate accuracies on the training data:

en_acc_train <- augment(final_fit_en_class, new_data = heart_train) %>%
  accuracy(truth = output, estimate = .pred_class)
en_acc_train
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.793

We can calculate accuracies on the testing data:

en_acc_test <- augment(final_fit_en_class, new_data = heart_test) %>%
  accuracy(truth = output, estimate = .pred_class)
en_acc_test
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.754
en_roc_test <- augment(final_fit_en_class, new_data = heart_test) %>%
  roc_auc(output, .pred_0)
en_roc_test
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.933

6.6 Pruned Decision Trees

tree_spec <- decision_tree(cost_complexity = tune()) %>%
  set_engine("rpart") %>% 
  set_mode("classification")

tree_wf <- workflow() %>% 
  add_model(tree_spec) %>% 
  add_recipe(heart_recipe)
param_grid <- grid_regular(cost_complexity(range = c(-5, -1)), levels = 10)

tune_tree <- tune_grid(
  tree_wf, 
  resamples = heart_folds, 
  grid = param_grid
)
autoplot(tune_tree)

best_complexity <- select_best(tune_tree)

tree_final <- finalize_workflow(tree_wf, best_complexity)

tree_final_fit <- fit(tree_final, data = heart_train)
best_complexity
## # A tibble: 1 × 2
##   cost_complexity .config              
##             <dbl> <chr>                
## 1         0.00001 Preprocessor1_Model01

we can extract the model fit results and view them with the rpart.plot() function

tree_final_fit %>%
  extract_fit_engine() %>%
  rpart.plot()

We can calculate accuracies on the training data:

tree_acc_train <- augment(tree_final_fit, new_data = heart_train) %>%
  accuracy(truth = output, estimate = .pred_class)
tree_acc_train
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.864

We can calculate accuracies on the testing data:

tree_acc_test <- augment(tree_final_fit, new_data = heart_test) %>%
  accuracy(truth = output, estimate = .pred_class)
tree_acc_test
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.787

The accuracy on the training data is approximately 86.36%, and it decreases to about 78.69% on the testing data. This drop in accuracy from training to testing suggests that the decision tree model might be overfitting to the training data or not generalizing well to new, unseen data.

tree_roc_test <- augment(tree_final_fit, new_data = heart_test) %>%
  roc_auc(output, .pred_0)
tree_roc_test
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.827

6.7 Random Forest

rf_class_spec <- rand_forest(mtry = tune(), 
                           trees = tune(), 
                           min_n = tune()) %>%
  set_engine("ranger", importance = "impurity") %>% 
  set_mode("classification")

rf_class_wf <- workflow() %>% 
  add_model(rf_class_spec) %>% 
  add_recipe(heart_recipe)
rf_grid <- grid_regular(mtry(range = c(1, 10)), 
                        trees(range = c(200, 600)),
                        min_n(range = c(10, 20)),
                        levels = 5)

rf_tune_class <- tune_grid(
  rf_class_wf,
  resamples = heart_folds,
  grid = rf_grid
)
save(rf_tune_class, file = "rf_tune_class.rda")
load("rf_tune_class.rda")
autoplot(rf_tune_class) + theme_minimal()

best_rf_class <- select_best(rf_tune_class)
best_rf_class
## # A tibble: 1 × 4
##    mtry trees min_n .config               
##   <int> <int> <int> <chr>                 
## 1     1   500    15 Preprocessor1_Model066
final_rf_model <- finalize_workflow(rf_class_wf, best_rf_class)
final_rf_model <- fit(final_rf_model, heart_train)
final_rf_model %>% extract_fit_parsnip() %>% 
  vip() +
  theme_minimal()

We can calculate accuracies on the training data:

rf_acc_train <- augment(final_rf_model, new_data = heart_train) %>%
  accuracy(truth = output, estimate = .pred_class)
rf_acc_train
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.913

We can calculate accuracies on the testing data:

rf_acc_test <- augment(final_rf_model, new_data = heart_test) %>%
  accuracy(truth = output, estimate = .pred_class)
rf_acc_test
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.902

The model’s accuracy is reported to be approximately 91.32% on the training data, with a slight decrease to about 90.16% on the testing data. These results indicate that the random forest model has a strong performance and generalizes well to unseen data, with only a small drop in accuracy between the training and testing sets, which suggests a good balance between bias and variance.

rf_roc_test <- augment(final_rf_model, new_data = heart_test) %>%
  roc_auc(output, .pred_0)
rf_roc_test
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.943

7. Project Conclusion

  1. Within the scope of the project, I first made the data set ready for Exploratory Data Analysis(EDA)

  2. performed Exploratory Data Analysis(EDA).

  3. Analyzed numerical and categorical variables.

  4. made the data set ready for the model.

  5. used seven different algorithms in the model phase.

  6. got 83% accuracy and 93% AUC with the KNN model.

  7. got 85% accuracy and 93% AUC with the Logistic Regression Model.

  8. got 85% accuracy and 93% AUC with the LDA (linear discriminant analysis) Model.

  9. got 82% accuracy and 88% AUC with the QDA (quadratic discriminant analysis).

  10. got 75% accuracy and 93.3% AUC with the Regularized Regression Model.

  11. got 78% accuracy and 82% AUC with the Pruned Decision Trees Model.

  12. got 90.2% accuracy and 95% AUC with the Random Forest Model.

  13. When all these model outputs are evaluated, I prefer the model we created with the Random Forest Algorithm, which gives the best results.